Internetarchive: A Python Interface to archive.org#

Internetarchive Library#

Internetarchive is a python interface to archive.org.

Usage:

>>> from internetarchive import get_item
>>> item = get_item('govlawgacode20071')
>>> item.exists
True
copyright
  1. 2012-2019 by Internet Archive.

license

AGPL 3, see LICENSE for more details.

internetarchive.Item#

class Item(archive_session, identifier: str, item_metadata: Mapping | None = None)[source]#

Bases: internetarchive.item.BaseItem

This class represents an archive.org item. Generally this class should not be used directly, but rather via the internetarchive.get_item() function:

>>> from internetarchive import get_item
>>> item = get_item('stairs')
>>> print(item.metadata)

Or to modify the metadata for an item:

>>> metadata = {'title': 'The Stairs'}
>>> item.modify_metadata(metadata)
>>> print(item.metadata['title'])
'The Stairs'

This class also uses IA’s S3-like interface to upload files to an item. You need to supply your IAS3 credentials in environment variables in order to upload:

>>> item.upload('myfile.tar', access_key='Y6oUrAcCEs4sK8ey',
...                           secret_key='youRSECRETKEYzZzZ')
True

You can retrieve S3 keys here: https://archive.org/account/s3.php

dark(comment: str, priority: int | str | None = None, data: Mapping | None = None, reduced_priority: bool = False, request_kwargs: Mapping | None = None) Response[source]#

Dark the item.

Parameters
  • comment – The curation comment explaining reason for darking item

  • priority – The task priority.

  • reduced_priority – Submit your derive at a lower priority. This option is helpful to get around rate-limiting. Your task will more likely be accepted, but it might not run for a long time. Note that you still may be subject to rate-limiting. This is different than priority in that it will allow you to possibly avoid rate-limiting.

  • data – Additional parameters to submit with the task.

Returns

requests.Response

derive(priority: int = 0, remove_derived: str | None = None, reduced_priority: bool = False, data: MutableMapping | None = None, headers: Mapping | None = None, request_kwargs: Mapping | None = None) Response[source]#

Derive an item.

Parameters
  • priority – Task priority from 10 to -10 [default: 0]

  • remove_derived – You can use wildcards (“globs”) to only remove some prior derivatives. For example, “*” (typed without the quotation marks) specifies that all derivatives (in the item’s top directory) are to be rebuilt. “.mp4” specifies that all “.mp4” deriviatives are to be rebuilt. “{.gif,*thumbs/.jpg}” specifies that all GIF and thumbs are to be rebuilt.

  • reduced_priority – Submit your derive at a lower priority. This option is helpful to get around rate-limiting. Your task will more likely be accepted, but it might not run for a long time. Note that you still may be subject to rate-limiting.

Returns

requests.Response

download(files: File | list[File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, exclude_pattern: str | None = None, dry_run: bool = False, verbose: bool = False, ignore_existing: bool = False, checksum: bool = False, destdir: str | None = None, no_directory: bool = False, retries: int | None = None, item_index: int | None = None, ignore_errors: bool = False, on_the_fly: bool = False, return_responses: bool = False, no_change_timestamp: bool = False, ignore_history_dir: bool = False, source: str | list[str] | None = None, exclude_source: str | list[str] | None = None, stdout: bool = False, params: Mapping | None = None, timeout: int | float | tuple[int, float] | None = None) list[Request | Response][source]#

Download files from an item.

Parameters
  • files – Only download files matching given file names.

  • formats – Only download files matching the given Formats.

  • glob_pattern – Only download files matching the given glob pattern.

  • exclude_pattern – Exclude files whose filename matches the given glob pattern.

  • dry_run – Output download URLs to stdout, don’t download anything.

  • verbose – Turn on verbose output.

  • ignore_existing – Skip files that already exist locally.

  • checksum – Skip downloading file based on checksum.

  • destdir – The directory to download files to.

  • no_directory – Download files to current working directory rather than creating an item directory.

  • retries – The number of times to retry on failed requests.

  • item_index – The index of the item for displaying progress in bulk downloads.

  • ignore_errors – Don’t fail if a single file fails to download, continue to download other files.

  • on_the_fly – Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).

  • return_responses – Rather than downloading files to disk, return a list of response objects.

  • no_change_timestamp – If True, leave the time stamp as the current time instead of changing it to that given in the original archive.

  • source – Filter files based on their source value in files.xml (i.e. original, derivative, metadata).

  • exclude_source – Filter files based on their source value in files.xml (i.e. original, derivative, metadata).

  • params – URL parameters to send with download request (e.g. cnt=0).

  • ignore_history_dir – Do not download any files from the history dir. This param defaults to False.

Returns

True if if all files have been downloaded successfully.

fixer(ops: list | str | None = None, priority: int | str | None = None, reduced_priority: bool = False, data: MutableMapping | None = None, headers: Mapping | None = None, request_kwargs: Mapping | None = None) Response[source]#

Submit a fixer task on an item.

Parameters
  • ops – The fixer operation(s) to run on the item [default: noop].

  • priority – The task priority.

  • reduced_priority – Submit your derive at a lower priority. This option is helpful to get around rate-limiting. Your task will more likely be accepted, but it might not run for a long time. Note that you still may be subject to rate-limiting. This is different than priority in that it will allow you to possibly avoid rate-limiting.

  • data – Additional parameters to submit with the task.

Returns

requests.Response

get_all_item_tasks(params: dict | None = None, request_kwargs: Mapping | None = None) list[catalog.CatalogTask][source]#

Get a list of all tasks for the item, pending and complete.

Parameters
  • params – Query parameters, refer to Tasks API for available parameters.

  • request_kwargs – Keyword arguments that requests.get() takes.

Returns

A list of all tasks for the item, pending and complete.

get_catalog(params: Mapping | None = None, request_kwargs: Mapping | None = None) list[catalog.CatalogTask][source]#

Get a list of pending catalog tasks for the item.

Parameters

params – Params to send with your request.

Returns

A list of pending catalog tasks for the item.

get_file(file_name: str, file_metadata: Mapping | None = None) File[source]#

Get a File object for the named file.

Parameters

file_metadata – a dict of metadata for the given file.

Returns

An internetarchive.File object.

get_history(params: Mapping | None = None, request_kwargs: Mapping | None = None) list[catalog.CatalogTask][source]#

Get a list of completed catalog tasks for the item.

Parameters

params – Params to send with your request.

Returns

A list of completed catalog tasks for the item.

get_task_summary(params: Mapping | None = None, request_kwargs: Mapping | None = None) dict[source]#

Get a summary of the item’s pending tasks.

Parameters

params – Params to send with your request.

Returns

A summary of the item’s pending tasks.

identifier_available() bool[source]#

Check if the item identifier is available for creating a new item.

Returns

True if identifier is available, or False if it is not available.

modify_metadata(metadata: Mapping, target: str | None = None, append: bool = False, append_list: bool = False, insert: bool = False, priority: int = 0, access_key: str | None = None, secret_key: str | None = None, debug: bool = False, headers: Mapping | None = None, request_kwargs: Mapping | None = None, timeout: int | float | None = None) Request | Response[source]#

Modify the metadata of an existing item on Archive.org.

Note: The Metadata Write API does not yet comply with the latest Json-Patch standard. It currently complies with version 02.

Parameters
  • metadata – Metadata used to update the item.

  • target – Set the metadata target to update.

  • priority – Set task priority.

  • append – Append value to an existing multi-value metadata field.

  • append_list – Append values to an existing multi-value metadata field. No duplicate values will be added.

Returns

A Request if debug else a Response.

Usage:

>>> import internetarchive
>>> item = internetarchive.Item('mapi_test_item1')
>>> md = {'new_key': 'new_value', 'foo': ['bar', 'bar2']}
>>> item.modify_metadata(md)
no_tasks_pending(params: Mapping | None = None, request_kwargs: Mapping | None = None) bool[source]#

Check if there is any pending task for the item.

Parameters

params – Params to send with your request.

Returns

True if no tasks are pending, otherwise False.

remove_from_simplelist(parent, list) requests.models.Response[source]#

Remove item from a simplelist.

Returns

requests.Response

undark(comment: str, priority: int | str | None = None, reduced_priority: bool = False, data: Mapping | None = None, request_kwargs: Mapping | None = None) Response[source]#

Undark the item.

Parameters
  • comment – The curation comment explaining reason for undarking item

  • priority – The task priority.

  • reduced_priority – Submit your derive at a lower priority. This option is helpful to get around rate-limiting. Your task will more likely be accepted, but it might not run for a long time. Note that you still may be subject to rate-limiting. This is different than priority in that it will allow you to possibly avoid rate-limiting.

  • data – Additional parameters to submit with the task.

Returns

requests.Response

upload(files, metadata: Mapping | None = None, headers: dict | None = None, access_key: str | None = None, secret_key: str | None = None, queue_derive=None, verbose: bool = False, verify: bool = False, checksum: bool = False, delete: bool = False, retries: int | None = None, retries_sleep: int | None = None, debug: bool = False, validate_identifier: bool = False, request_kwargs: dict | None = None, set_scanner: bool = True) list[Request | Response][source]#

Upload files to an item. The item will be created if it does not exist.

Parameters
  • files (str, file, list, tuple, dict) – The filepaths or file-like objects to upload.

  • **kwargs – Optional arguments that Item.upload_file() takes.

Returns

A list of requests.Response objects.

Usage:

>>> import internetarchive
>>> item = internetarchive.Item('identifier')
>>> md = {'mediatype': 'image', 'creator': 'Jake Johnson'}
>>> item.upload('/path/to/image.jpg', metadata=md, queue_derive=False)
[<Response [200]>]

Uploading multiple files:

>>> r = item.upload(['file1.txt', 'file2.txt'])
>>> r = item.upload([fileobj, fileobj2])
>>> r = item.upload(('file1.txt', 'file2.txt'))

Uploading file objects:

>>> import io
>>> f = io.BytesIO(b'some initial binary data: \x00\x01')
>>> r = item.upload({'remote-name.txt': f})
>>> f = io.BytesIO(b'some more binary data: \x00\x01')
>>> f.name = 'remote-name.txt'
>>> r = item.upload(f)

Note: file objects must either have a name attribute, or be uploaded in a dict where the key is the remote-name

Setting the remote filename with a dict:

>>> r = item.upload({'remote-name.txt': '/path/to/local/file.txt'})
upload_file(body, key: str | None = None, metadata: Mapping | None = None, file_metadata: Mapping | None = None, headers: dict | None = None, access_key: str | None = None, secret_key: str | None = None, queue_derive: bool = False, verbose: bool = False, verify: bool = False, checksum: bool = False, delete: bool = False, retries: int | None = None, retries_sleep: int | None = None, debug: bool = False, validate_identifier: bool = False, request_kwargs: MutableMapping | None = None, set_scanner: bool = True) Request | Response[source]#

Upload a single file to an item. The item will be created if it does not exist.

Parameters
  • body (Filepath or file-like object.) – File or data to be uploaded.

  • key – Remote filename.

  • metadata – Metadata used to create a new item.

  • file_metadata – File-level metadata to add to the files.xml entry for the file being uploaded.

  • headers – Add additional IA-S3 headers to request.

  • queue_derive – Set to False to prevent an item from being derived after upload.

  • verify – Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.

  • checksum – Skip based on checksum.

  • delete – Delete local file after the upload has been successfully verified.

  • retries – Number of times to retry the given request if S3 returns a 503 SlowDown error.

  • retries_sleep – Amount of time to sleep between retries.

  • verbose – Print progress to stdout.

  • debug – Set to True to print headers to stdout, and exit without sending the upload request.

  • validate_identifier – Set to True to validate the identifier before uploading the file.

Usage:

>>> import internetarchive
>>> item = internetarchive.Item('identifier')
>>> item.upload_file('/path/to/image.jpg',
...                  key='photos/image1.jpg')
True

internetarchive.File#

class File(item, name, file_metadata=None)[source]#

Bases: internetarchive.files.BaseFile

This class represents a file in an archive.org item. You can use this class to access the file metadata:

>>> import internetarchive
>>> item = internetarchive.Item('stairs')
>>> file = internetarchive.File(item, 'stairs.avi')
>>> print(f.format, f.size)
('Cinepack', '3786730')

Or to download a file:

>>> file.download()
>>> file.download('fabulous_movie_of_stairs.avi')

This class also uses IA’s S3-like interface to delete a file from an item. You need to supply your IAS3 credentials in environment variables in order to delete:

>>> file.delete(access_key='Y6oUrAcCEs4sK8ey',
...             secret_key='youRSECRETKEYzZzZ')

You can retrieve S3 keys here: https://archive.org/account/s3.php

delete(cascade_delete=None, access_key=None, secret_key=None, verbose=None, debug=None, retries=None, headers=None)[source]#

Delete a file from the Archive. Note: Some files – such as <itemname>_meta.xml – cannot be deleted.

Parameters
  • cascade_delete (bool) – (optional) Delete all files associated with the specified file, including upstream derivatives and the original.

  • access_key (str) – (optional) IA-S3 access_key to use when making the given request.

  • secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.

  • verbose (bool) – (optional) Print actions to stdout.

  • debug (bool) – (optional) Set to True to print headers to stdout and exit exit without sending the delete request.

download(file_path=None, verbose=None, ignore_existing=None, checksum=None, destdir=None, retries=None, ignore_errors=None, fileobj=None, return_responses=None, no_change_timestamp=None, params=None, chunk_size=None, stdout=None, ors=None, timeout=None)[source]#

Download the file into the current working directory.

Parameters
  • file_path (str) – Download file to the given file_path.

  • verbose (bool) – (optional) Turn on verbose output.

  • ignore_existing (bool) – Overwrite local files if they already exist.

  • checksum (bool) – (optional) Skip downloading file based on checksum.

  • destdir (str) – (optional) The directory to download files to.

  • retries (int) – (optional) The number of times to retry on failed requests.

  • ignore_errors (bool) – (optional) Don’t fail if a single file fails to download, continue to download other files.

  • fileobj (file-like object) – (optional) Write data to the given file-like object (e.g. sys.stdout).

  • return_responses (bool) – (optional) Rather than downloading files to disk, return a list of response objects.

  • no_change_timestamp (bool) – (optional) If True, leave the time stamp as the current time instead of changing it to that given in the original archive.

  • stdout (bool) – (optional) Print contents of file to stdout instead of downloading to file.

  • ors (bool) – (optional) Append a newline or $ORS to the end of file. This is mainly intended to be used internally with stdout.

  • params (dict) – (optional) URL parameters to send with download request (e.g. cnt=0).

Return type

bool

Returns

True if file was successfully downloaded.

internetarchive.Catalog#

class Catalog(archive_session: ia_session.ArchiveSession, request_kwargs: Mapping | None = None)[source]#

Bases: object

This class represents the Archive.org catalog. You can use this class to access and submit tasks from the catalog.

This is a low-level interface, and in most cases the functions in internetarchive.api and methods in ArchiveSession should be used.

It uses the archive.org Tasks API

Usage::
>>> from internetarchive import get_session, Catalog
>>> s = get_session()
>>> c = Catalog(s)
>>> tasks = c.get_tasks('nasa')
>>> tasks[-1].task_id
31643502
get_summary(identifier: str = '', params: dict | None = None) dict[source]#

Get the total counts of catalog tasks meeting all criteria, organized by run status (queued, running, error, and paused).

Parameters
  • identifier – Item identifier.

  • params – Query parameters, refer to

Tasks API for available parameters.

Returns

the total counts of catalog tasks meeting all criteria

get_tasks(identifier: str = '', params: dict | None = None) list[CatalogTask][source]#

Get a list of all tasks meeting all criteria. The list is ordered by submission time.

Parameters
  • identifier – The item identifier, if provided will return tasks for only this item filtered by other criteria provided in params.

  • params – Query parameters, refer to

Tasks API for available parameters.

Returns

A list of all tasks meeting all criteria.

iter_tasks(params: MutableMapping | None = None) Iterable[CatalogTask][source]#

A generator that can make arbitrary requests to the Tasks API. It handles paging (via cursor) automatically.

Parameters

params

Query parameters, refer to Tasks API for available parameters.

Returns

collections.Iterable[CatalogTask]

make_tasks_request(params: Mapping | None) Response[source]#
Make a GET request to the

Tasks API

Parameters

params

Query parameters, refer to Tasks API for available parameters.

Returns

requests.Response

submit_task(identifier: str, cmd: str, comment: str | None = None, priority: int = 0, data: dict | None = None, headers: dict | None = None) Response[source]#

Submit an archive.org task.

Parameters
  • identifier – Item identifier.

  • cmd – Task command to submit, see supported task commands.

  • comment – A reasonable explanation for why the task is being submitted.

  • priority – Task priority from 10 to -10 (default: 0).

  • data – Extra POST data to submit with the request. Refer to Tasks API Request Entity.

  • headers – Add additional headers to request.

Returns

requests.Response

internetarchive.ArchiveSession#

class ArchiveSession(config: Mapping | None = None, config_file: str = '', debug: bool = False, http_adapter_kwargs: MutableMapping | None = None)[source]#

Bases: requests.sessions.Session

The ArchiveSession object collects together useful functionality from internetarchive as well as important data such as configuration information and credentials. It is subclassed from requests.Session.

Usage:

>>> from internetarchive import ArchiveSession
>>> s = ArchiveSession()
>>> item = s.get_item('nasa')
Collection(identifier='nasa', exists=True)
get_item(identifier: str, item_metadata: Mapping | None = None, request_kwargs: MutableMapping | None = None)[source]#

A method for creating internetarchive.Item and internetarchive.Collection objects.

Parameters
  • identifier – A globally unique Archive.org identifier.

  • item_metadata – A metadata dict used to initialize the Item or Collection object. Metadata will automatically be retrieved from Archive.org if nothing is provided.

  • request_kwargs – Keyword arguments to be used in requests.sessions.Session.get() request.

get_metadata(identifier: str, request_kwargs: MutableMapping | None = None)[source]#

Get an item’s metadata from the Metadata API

Parameters

identifier – Globally unique Archive.org identifier.

Returns

Metadat API response.

get_my_catalog(params: dict | None = None, request_kwargs: Mapping | None = None) set[catalog.CatalogTask][source]#

Get all queued or running tasks.

Parameters
  • params

    Query parameters, refer to Tasks API for available parameters.

  • request_kwargs – Keyword arguments to be used in requests.sessions.Session.get() request.

Returns

A set of all queued or running tasks.

get_task_log(task_id: str | int, request_kwargs: Mapping | None = None) str[source]#

Get a task log.

Parameters
  • task_id – The task id for the task log you’d like to fetch.

  • request_kwargs – Keyword arguments that requests.Request takes.

Returns

The task log as a string.

get_tasks(identifier: str = '', params: dict | None = None, request_kwargs: Mapping | None = None) set[catalog.CatalogTask][source]#

Get a list of all tasks meeting all criteria. The list is ordered by submission time.

Parameters
  • identifier – The item identifier, if provided will return tasks for only this item filtered by other criteria provided in params.

  • params

    Query parameters, refer to Tasks API for available parameters.

  • request_kwargs – Keyword arguments to be used in requests.sessions.Session.get() request.

Returns

A set of all tasks meeting all criteria.

get_tasks_summary(identifier: str = '', params: dict | None = None, request_kwargs: Mapping | None = None) dict[source]#

Get the total counts of catalog tasks meeting all criteria, organized by run status (queued, running, error, and paused).

Parameters
  • identifier – Item identifier.

  • params

    Query parameters, refer to Tasks API for available parameters.

  • request_kwargs – Keyword arguments to be used in requests.sessions.Session.get() request.

Returns

Counts of catalog tasks meeting all criteria.

iter_catalog(identifier: str | None = None, params: dict | None = None, request_kwargs: Mapping | None = None) Iterable[catalog.CatalogTask][source]#

A generator that returns queued or running tasks.

Parameters
  • identifier – Item identifier.

  • params

    Query parameters, refer to Tasks API for available parameters.

  • request_kwargs – Keyword arguments to be used in requests.sessions.Session.get() request.

Returns

An iterable of queued or running CatalogTasks.

iter_history(identifier: str | None, params: dict | None = None, request_kwargs: Mapping | None = None) Iterable[catalog.CatalogTask][source]#

A generator that returns completed tasks.

Parameters
  • identifier – Item identifier.

  • params

    Query parameters, refer to Tasks API for available parameters.

  • request_kwargs – Keyword arguments to be used in requests.sessions.Session.get() request.

Returns

An iterable of completed CatalogTasks.

mount_http_adapter(protocol: str | None = None, max_retries: int | None = None, status_forcelist: list | None = None, host: str | None = None) None[source]#

Mount an HTTP adapter to the ArchiveSession object.

Parameters
  • protocol – HTTP protocol to mount your adapter to (e.g. ‘https://’).

  • max_retries – The number of times to retry a failed request. This can also be an urllib3.Retry object.

  • status_forcelist – A list of status codes (as int’s) to retry on.

  • host – The host to mount your adapter to.

rebuild_auth(prepared_request, response)[source]#

Never rebuild auth for archive.org URLs.

search_items(query: str, fields: Iterable[str] | None = None, sorts: Iterable[str] | None = None, params: Mapping | None = None, full_text_search: bool = False, dsl_fts: bool = False, request_kwargs: Mapping | None = None, max_retries: int | Retry | None = None) Search[source]#

Search for items on Archive.org.

Parameters
  • query – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.

  • fields – The metadata fields to return in the search results.

  • params – The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.

  • full_text_search – Beta support for querying the archive.org Full Text Search API [default: False].

  • dsl_fts – Beta support for querying the archive.org Full Text Search API in dsl (i.e. do not prepend !L `` to the ``full_text_search query [default: False].

Returns

A Search object, yielding search results.

send(request, **kwargs) requests.models.Response[source]#

Send a given PreparedRequest.

Return type

requests.Response

set_file_logger(log_level: str, path: str, logger_name: str = 'internetarchive') None[source]#

Convenience function to quickly configure any level of logging to a file.

Parameters
  • log_level – A log level as specified in the logging module.

  • path – Path to the log file. The file will be created if it doesn’t already exist.

  • logger_name – The name of the logger.

submit_task(identifier: str, cmd: str, comment: str = '', priority: int = 0, data: dict | None = None, headers: dict | None = None, reduced_priority: bool = False, request_kwargs: Mapping | None = None) requests.Response[source]#

Submit an archive.org task.

Parameters
  • identifier – Item identifier.

  • cmd

    Task command to submit, see supported task commands.

  • comment – A reasonable explanation for why the task is being submitted.

  • priority – Task priority from 10 to -10 (default: 0).

  • data

    Extra POST data to submit with the request. Refer to Tasks API Request Entity.

  • headers – Add additional headers to request.

  • reduced_priority – Submit your derive at a lower priority. This option is helpful to get around rate-limiting. Your task will more likely be accepted, but it might not run for a long time. Note that you still may be subject to rate-limiting. This is different than priority in that it will allow you to possibly avoid rate-limiting.

  • request_kwargs – Keyword arguments to be used in requests.sessions.Session.post() request.

Returns

requests.Response

internetarchive.api#

internetarchive.api#

This module implements the Internetarchive API.

copyright
  1. 2012-2019 by Internet Archive.

license

AGPL 3, see LICENSE for more details.

configure(username: str = '', password: str = '', config_file: str = '', host: str = 'archive.org') str[source]#

Configure internetarchive with your Archive.org credentials.

Parameters
  • username – The email address associated with your Archive.org account.

  • password – Your Archive.org password.

Returns

The config file path.

Usage:
>>> from internetarchive import configure
>>> configure('user@example.com', 'password')
delete(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, cascade_delete: bool = False, access_key: str | None = None, secret_key: str | None = None, verbose: bool = False, debug: bool = False, **kwargs) list[requests.Request | requests.Response][source]#

Delete files from an item. Note: Some system files, such as <itemname>_meta.xml, cannot be deleted.

Parameters
  • identifier – The globally unique Archive.org identifier for a given item.

  • files – Only return files matching the given filenames.

  • formats – Only return files matching the given formats.

  • glob_pattern – Only return files matching the given glob pattern.

  • cascade_delete – Delete all files associated with the specified file, including upstream derivatives and the original.

  • access_key – IA-S3 access_key to use when making the given request.

  • secret_key – IA-S3 secret_key to use when making the given request.

  • verbose – Print actions to stdout.

  • debug – Set to True to print headers to stdout and exit exit without sending the delete request.

Returns

A list Requests if debug else a list of Responses

download(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, dry_run: bool = False, verbose: bool = False, ignore_existing: bool = False, checksum: bool = False, destdir: str | None = None, no_directory: bool = False, retries: int | None = None, item_index: int | None = None, ignore_errors: bool = False, on_the_fly: bool = False, return_responses: bool = False, no_change_timestamp: bool = False, timeout: int | float | tuple[int, float] | None = None, **get_item_kwargs) list[requests.Request | requests.Response][source]#

Download files from an item.

Parameters
  • identifier – The globally unique Archive.org identifier for a given item.

  • files – Only return files matching the given file names.

  • formats – Only return files matching the given formats.

  • glob_pattern – Only return files matching the given glob pattern.

  • dry_run – Print URLs to files to stdout rather than downloading them.

  • verbose – Turn on verbose output.

  • ignore_existing – Skip files that already exist locally.

  • checksum – Skip downloading file based on checksum.

  • destdir – The directory to download files to.

  • no_directory – Download files to current working directory rather than creating an item directory.

  • retries – The number of times to retry on failed requests.

  • item_index – The index of the item for displaying progress in bulk downloads.

  • ignore_errors – Don’t fail if a single file fails to download, continue to download other files.

  • on_the_fly – Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).

  • return_responses – Rather than downloading files to disk, return a list of response objects.

  • **kwargs – Optional arguments that get_item takes.

Returns

A list Requests if debug else a list of Responses.

get_files(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, exclude_pattern: str | None = None, on_the_fly: bool = False, **get_item_kwargs) list[files.File][source]#

Get File objects from an item.

Parameters
  • identifier – The globally unique Archive.org identifier for a given item.

  • files – Only return files matching the given filenames.

  • formats – Only return files matching the given formats.

  • glob_pattern – Only return files matching the given glob pattern.

  • exclude_pattern – Exclude files matching the given glob pattern.

  • on_the_fly – Include on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).

  • **get_item_kwargs – Arguments that get_item() takes.

Returns

Files from an item.

Usage:
>>> from internetarchive import get_files
>>> fnames = [f.name for f in get_files('nasa', glob_pattern='*xml')]
>>> print(fnames)
['nasa_reviews.xml', 'nasa_meta.xml', 'nasa_files.xml']
get_item(identifier: str, config: Mapping | None = None, config_file: str | None = None, archive_session: session.ArchiveSession | None = None, debug: bool = False, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: MutableMapping | None = None) item.Item[source]#

Get an Item object.

Parameters
  • identifier – The globally unique Archive.org item identifier.

  • config – A dictionary used to configure your session.

  • config_file – A path to a config file used to configure your session.

  • archive_session – An ArchiveSession object can be provided via the archive_session parameter.

  • debug – To be passed on to get_session().

  • http_adapter_kwargs – Keyword arguments that requests.adapters.HTTPAdapter takes.

  • request_kwargs – Keyword arguments that requests.Request takes.

Returns

The Item that fits the criteria.

Usage:
>>> from internetarchive import get_item
>>> item = get_item('nasa')
>>> item.item_size
121084
get_session(config: Mapping | None = None, config_file: str | None = None, debug: bool = False, http_adapter_kwargs: MutableMapping | None = None) session.ArchiveSession[source]#

Return a new ArchiveSession object. The ArchiveSession object is the main interface to the internetarchive lib. It allows you to persist certain parameters across tasks.

Parameters
  • config – A dictionary used to configure your session.

  • config_file – A path to a config file used to configure your session.

  • debug – To be passed on to this session’s method calls.

  • http_adapter_kwargs – Keyword arguments that requests.adapters.HTTPAdapter takes.

Returns

To persist certain parameters across tasks.

Usage:

>>> from internetarchive import get_session
>>> config = {'s3': {'access': 'foo', 'secret': 'bar'}}
>>> s = get_session(config)
>>> s.access_key
'foo'

From the session object, you can access all of the functionality of the internetarchive lib:

>>> item = s.get_item('nasa')
>>> item.download()
nasa: ddddddd - success
>>> s.get_tasks(task_ids=31643513)[0].server
'ia311234'
get_tasks(identifier: str = '', params: dict | None = None, config: Mapping | None = None, config_file: str | None = None, archive_session: session.ArchiveSession | None = None, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: MutableMapping | None = None) set[catalog.CatalogTask][source]#

Get tasks from the Archive.org catalog.

Parameters
  • identifier – The Archive.org identifier for which to retrieve tasks for.

  • params – The URL parameters to send with each request sent to the Archive.org catalog API.

Returns

A set of CatalogTask objects.

get_user_info(access_key: str, secret_key: str) dict[str, str][source]#

Returns details about an Archive.org user given an IA-S3 key pair.

Parameters
  • access_key – IA-S3 access_key to use when making the given request.

  • secret_key – IA-S3 secret_key to use when making the given request.

Returns

Archive.org use info.

get_username(access_key: str, secret_key: str) str[source]#

Returns an Archive.org username given an IA-S3 key pair.

Parameters
  • access_key – IA-S3 access_key to use when making the given request.

  • secret_key – IA-S3 secret_key to use when making the given request.

Returns

The username.

modify_metadata(identifier: str, metadata: Mapping, target: str | None = None, append: bool = False, append_list: bool = False, priority: int = 0, access_key: str | None = None, secret_key: str | None = None, debug: bool = False, request_kwargs: Mapping | None = None, **get_item_kwargs) requests.Request | requests.Response[source]#

Modify the metadata of an existing item on Archive.org.

Parameters
  • identifier – The globally unique Archive.org identifier for a given item.

  • metadata – Metadata used to update the item.

  • target – The metadata target to update. Defaults to metadata.

  • append – set to True to append metadata values to current values rather than replacing. Defaults to False.

  • append_list – Append values to an existing multi-value metadata field. No duplicate values will be added.

  • priority – Set task priority.

  • access_key – IA-S3 access_key to use when making the given request.

  • secret_key – IA-S3 secret_key to use when making the given request.

  • debug – set to True to return a requests.Request object instead of sending request. Defaults to False.

  • **get_item_kwargs – Arguments that get_item takes.

Returns

A Request if debug else a Response.

search_items(query: str, fields: Iterable | None = None, sorts=None, params: Mapping | None = None, full_text_search: bool = False, dsl_fts: bool = False, archive_session: session.ArchiveSession | None = None, config: Mapping | None = None, config_file: str | None = None, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: Mapping | None = None, max_retries: int | Retry | None = None) search.Search[source]#

Search for items on Archive.org.

Parameters
  • query – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.

  • fields – The metadata fields to return in the search results.

  • params – The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.

  • full_text_search – Beta support for querying the archive.org Full Text Search API [default: False].

  • dsl_fts – Beta support for querying the archive.org Full Text Search API in dsl (i.e. do not prepend !L `` to the ``full_text_search query [default: False].

  • secure – Configuration options for session.

  • config_file – A path to a config file used to configure your session.

  • http_adapter_kwargs – Keyword arguments that requests.adapters.HTTPAdapter takes.

  • request_kwargs – Keyword arguments that requests.Request takes.

  • max_retries

    The number of times to retry a failed request. This can also be an urllib3.Retry object. If you need more control (e.g. status_forcelist), use a ArchiveSession object, and mount your own adapter after the session object has been initialized. For example:

    >>> s = get_session()
    >>> s.mount_http_adapter()
    >>> search_results = s.search_items('nasa')
    

    See ArchiveSession.mount_http_adapter() for more details.

Returns

A Search object, yielding search results.

upload(identifier: str, files, metadata: Mapping | None = None, headers: dict | None = None, access_key: str | None = None, secret_key: str | None = None, queue_derive=None, verbose: bool = False, verify: bool = False, checksum: bool = False, delete: bool = False, retries: int | None = None, retries_sleep: int | None = None, debug: bool = False, validate_identifier: bool = False, request_kwargs: dict | None = None, **get_item_kwargs) list[requests.Request | requests.Response][source]#

Upload files to an item. The item will be created if it does not exist.

Parameters
  • identifier – The globally unique Archive.org identifier for a given item.

  • files – The filepaths or file-like objects to upload. This value can be an iterable or a single file-like object or string.

  • metadata – Metadata used to create a new item. If the item already exists, the metadata will not be updated – use modify_metadata.

  • headers – Add additional HTTP headers to the request.

  • access_key – IA-S3 access_key to use when making the given request.

  • secret_key – IA-S3 secret_key to use when making the given request.

  • queue_derive – Set to False to prevent an item from being derived after upload.

  • verbose – Display upload progress.

  • verify – Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.

  • checksum – Skip uploading files based on checksum.

  • delete – Delete local file after the upload has been successfully verified.

  • retries – Number of times to retry the given request if S3 returns a 503 SlowDown error.

  • retries_sleep – Amount of time to sleep between retries.

  • debug – Set to True to print headers to stdout, and exit without sending the upload request.

  • validate_identifier – Set to True to validate the identifier before uploading the file.

  • **kwargs – Optional arguments that get_item takes.

Returns

A list Requests if debug else a list of Responses.